大規模並列プロセッサのプログラミング：実践入門：ハードウェアのボトルネック：メモリとリソース制限

現代の高性能コンピューティングは、根本的な "メモリウォール"：計算処理能力（FLOPS）の急激な増加が、わずかな増加にとどまる グローバルメモリ 帯域幅をはるかに超えており、この乖離により巨大なマルチコア配列はデータ待ちの「飢えている」プロセッサになってしまう。

1. 帯域幅のギャップ

GPUは1秒間に数兆回の演算を行うことができる一方で、DRAMへの物理的経路はピン密度と電力要件によって制約されている。 メモリが並列性の制限要因となる スレッド数を増やすと、スレッドあたりの帯域幅が低下し、ハードウェアが待機状態になるスタールサイクルが発生する。

2. キッチンの比喩

最先端のキッチン（GPUコア）が1時間に1,000食を調理できると想像してほしい。しかし、材料は5マイル離れた倉庫（グローバルメモリ）にあり、配送用スクーター（メモリバス）は1台だけだ。どれだけシェフを雇っても、出力はスクーターの速度によって制限される。

3. アーキテクチャの対比

標準的な マルチコアCPUシステム 少数の重いスレッドの遅延を隠すために巨大なキャッシュを使用する。一方、大規模並列アーキテクチャは、同時に多数のリクエストが集中する常時「渋滞」に直面する。 リソース制限 レジスタおよび共有メモリレベルでのリソース制限は、ハードウェアが過負荷になる前に達成可能な最大並列度（オキュパンシー）を決定する。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of the 'Memory Wall' in modern GPU computing?

The clock speed of cores is too slow to process DRAM data.

Computational throughput (FLOPS) has increased much faster than memory bandwidth.

Shared memory is too large for the hardware to manage.

Global memory has higher latency than CPU registers.

QUESTION 2

In the 'Kitchen Analogy,' what does the delivery scooter represent?

The GPU Core/Chef.

The Register File.

The Global Memory Bus.

The Operating System Scheduler.

QUESTION 3

How do resource limitations like register count affect parallelism?

They increase the speed of each individual thread.

They limit occupancy by reducing the number of active threads that can reside on an SM.

They have no effect on throughput, only on power consumption.

They bypass the need for global memory access.

QUESTION 4

When a kernel is in the 'Memory Bound' region of the Roofline Model, what is the best way to improve performance?

Increase the number of floating-point operations per second.

Increase the arithmetic intensity (data reuse).

Decrease the number of threads per block.

Add more complex branching logic.

QUESTION 5

Why is implicit synchronization unreliable in massively parallel architectures?

Hardware evolution means threads within a warp may not stay locked in SIMT fashion.

Shared memory is too fast for synchronization to matter.

Global memory access is always synchronous.

Threads are processed sequentially in blocks.